Stockwell transform and semi-supervised feature selection from deep features for classification of BCI signals

Over the past few years, the processing of motor imagery (MI) electroencephalography (EEG) signals has been attracted for developing brain-computer interface (BCI) applications, since feature extraction and classification of these signals are extremely difficult due to the inherent complexity and tendency to artifact properties of them. The BCI systems can provide a direct interaction pathway/channel between the brain and a peripheral device, hence the MI EEG-based BCI systems seem crucial to control external devices for patients suffering from motor disabilities. The current study presents a semi-supervised model based on three-stage feature extraction and machine learning algorithms for MI EEG signal classification in order to improve the classification accuracy with smaller number of deep features for distinguishing right- and left-hand MI tasks. Stockwell transform is employed at the first phase of the proposed feature extraction method to generate two-dimensional time–frequency maps (TFMs) from one-dimensional EEG signals. Next, the convolutional neural network (CNN) is applied to find deep feature sets from TFMs. Then, the semi-supervised discriminant analysis (SDA) is utilized to minimize the number of descriptors. Finally, the performance of five classifiers, including support vector machine, discriminant analysis, k-nearest neighbor, decision tree, random forest, and the fusion of them are compared. The hyperparameters of SDA and mentioned classifiers are optimized by Bayesian optimization to maximize the accuracy. The presented model is validated using BCI competition II dataset III and BCI competition IV dataset 2b. The performance metrics of the proposed method indicate its efficiency for classifying MI EEG signals.

is still challenging to accurately classify different MI states. Therefore, MI activity has been utilized for the BCI system in this work, with our goal to improve the classification performance with smaller number of features for MI tasks using three-step feature extraction technique.
Feature extraction and classification are the two salient factors in MI EEG signal processing. The analysis of the EEG signals begins with identifying their informative features. Typical spatial pattern (CSP) and CSP-based methods are popular feature extraction techniques in various MI studies [12][13][14] . Authors in Ref. 15 have used the filter bank CSP (FBCSP) algorithm along with the principle component analysis (PCA) to select and reduce features from EEG signals which then are classified by the eXtreme gradient boosting (XGBoost) algorithm. Also, there have been several studies that use graph theory and functional connectivity to analyze EEG signals in MI tasks 16 . In another study, a frequency-based approach using CSP features from overlapping sub-bands was proposed for MI classification. Using all available channels, the method selects the most discriminating filter banks 17 . A number of studies have also examined the effectiveness of time-domain, frequency-domain, and the fusion of both information on the performance of MI EEG classification 18 . Recently, RNN-based metaheuristic algorithms, time-varying equations are applied to the control of robotic 19 , where an artificial dynamic system based on EMG signals and joint information was introduced to detect human motion intention in lower body parts. Also, neural network models have been used for and time-varying optimization problems 20 . Using a combination of RNN and CNN architectures, the work in Ref. 21 classified a four-class MI on the BCI competition IV dataset 2a with the goal of having a model that could be applied to all participants. However, the performance of current studies in MI-EEG classification is still not comparable to other fields like image and speech recognition. The shorttime Fourier transform (STFT) and the wavelet transform are also popular time-frequency approaches, which have been developed to extract the various EEG frequency characteristics over time 12,22,23 . In another reported study 24 , the STFT features of the input signals were extracted and then classified using a network based on ResNet. However, the limited width of the window in STFT results in constant resolution in both time and frequency domains; hence, it cannot provide proper frequency resolution at low frequency and good time resolution at high frequency. Several studies indicated that continuous wavelet transform (CWT) with variant mother wavelets represents appropriate multi-scale analysis for extracting significant features in the time-frequency resolution over MI EEG signals in BCI tasks [25][26][27] . Various machine learning methods have been employed to classify MI EEG signals, such as support vector machine (SVM) 28 , linear discriminant analysis (LDA) 29 , k-nearest neighbor (kNN) 30 , and other methods 23,31 . Deep learning models such as convolutional neural networks (CNNs) have been recently used in the BCI studies [32][33][34] .
In Ref. 27 , the authors considered the CWT and a four-layer CNN for classification. They improved average classification accuracy using three mother wavelets compared to the STFT on BCI competition II dataset III and BCI competition IV dataset 2b. In Ref. 33 , different mother wavelets were presented for time-frequency mapping of the EEG signals. Then a two-layer CNN was developed to classify a combination of TFMs of C3, Cz, and C4 channels into the left-and right-hand MI tasks. The accuracy rate of their work was 92.75% in dataset III from BCI competition II. Kant et al. 34 converted the EEG signals into two-dimensional TFMs using the CWT. They used dataset III of BCI competition II in three different frequency spectrums and several transfer learning methods, including VGG19, AlexNet, VGG16, ResNet50, GoogleNet, and ResNet101, were applied to classify the MI data. They achieved maximum accuracy of 95.71% in full-band  Hz) by VGG19. Furthermore, time-frequency images obtained by Morlet wavelet transforms in Ref. 35 , were classified using an extended CNN with convolutional block attention modules (CBAM) with an accuracy of 90.7% on the BCI dataset III. The disadvantages of wavelet transform as the feature extraction method in these works are poor time resolution at low frequencies and finding an optimum window function before operation. Stockwell transform was presented to overcome the drawbacks of wavelet transform 36,37 . In Ref. 38 , Stockwell transform divided different MI signals into distinct frequency regions to prepare a distinguished feature vector combined with the CSP technique as a multi-step feature extraction method. The performances of three different classification techniques of least square-SVM (LS-SVM), random forest (RF), and artificial neural network (ANN) were compared. Accordingly, 95.55% accuracy was achieved with the LS-SVM classifier on BCI competition III dataset IIIa.
MI tasks have been classified with several different techniques, but currently, there is no superior algorithm that provides better results for most applications. Instead of using an individual classifier, the ensembles of different base classifiers have shown promising results for BCI 39,40 . Clearly, the quality of an ensemble method can be defined by its accuracy and diversity 41 . In Ref. 40 , a comparative study of three ensemble architectures based on three base machine learning classifiers of kNN, SVM, and Naive Bayes (NB) were represented to classify different feature sets extracted from MI data best performance was reported using Adaboost ensemble learning with multiple base classifiers. In Ref. 42 , a majority voting ensemble model of five individual classifiers [LDA, kNN, SVM, NB, and decision tree (DT)] showed a better average classification accuracy than every single classifier for multi-class motor imagery EEG signals. Although different ensemble learning methods can enhance the overall accuracy, they cannot consistently outperform the best individual classifier for some applications due to the different characteristics of the input datasets 43 .
BCI employs the brain activity for communication of paralyzed people with intact brain functions. However, the non-stationarity nature of brain activity and physiological artifacts contained in brain activity limit the performance and reliability of BCI technologies. Hence, our aim in this is to enhance the performance of MI task classification. Due to the nonlinear characteristics of MI EEG signals, it is preferable to employ time-frequency transforms to analyze these signals. Considering the explanations provided in the literature review, our objective is to improve the classification performance of BCI tasks using a smaller number of deep features and fusion algorithms before deep feature extraction and in decision levels. This paper uses the Stockwell transform to obtain the TFMs of MI EEG signals. Then, CNN is considered to elicit the robust deep features from TFMs. Since too many features have been extracted, they should be reduced to alleviate the computational complexity. To this end, we consider the semi-supervised discriminant analysis (SDA), which maximizes the separation of classes Dataset. The EEG signals for this study was taken from two datasets namely BCI competition II dataset III 44 and BCI competition IV dataset 2b which respectivey refered as II-III and IV-2b 45 . Table 1 summarizes the detail of the datasets. In the following, a detailed description of each dataset will be presented.
The dataset II-III recorded the motor cortex's channels C3, C4 and Cz for a normal subject (a 25-year-old woman). It consists of MI task experiments for the left-and right-hand motions. In total, 280 trials of 9 s length are in the dataset. 140 of them are for training, and 140 are for testing. Following the first two seconds of silence, an acoustic stimulus was given at t = 2 s, followed by the cross "+" display for one second. After that, a cue (left or right) was shown to the subject from t = 3-9 s, and the subject was instructed to perform the imagery task. Each of the trials follows the same pattern as shown in Fig. 2a. The sampling rate was 128 Hz, and the signals were filtered between 0.5 and 30 Hz. Figure 3 presents one recording from each task in different channels.
The three-channel (C3, Cz and C4) EEG signals composing dataset IV-2b were collected from nine subjects 45 under the sampling frequency of 250 Hz. To eliminate the signal noise, a band-pass filter in the range [0.5, 100] Hz is employed. Similar to the dataset II-III, imaginations of left hand movement and right hand movement were perfomred. EEG signals for each subjectwere recorded in five sessions, without feedback in the first two sessions, with feedback in the remaining three sessions. Each trail was recorded as shown in Fig. 2b-c. Time-frequency analysis. It should be mentioned that Motor movements, which are called ERS and ERD in brain activity, occur in the alpha (8)(9)(10)(11)(12)(13) and beta (14-28 Hz) frequency bands, so we considered the output of Stockwell transform in the range 7-30 Hz. Hence, it is not required to remove the effect of the 50 Hz industrial frequency signal from raw EEG signal before computing Stockwell transform. Since EEG signals have nonlinear and non-stationarity characteristics, various time-frequency decomposition methods, such as STFT, wavelet transform, and Stockwell transform, have been conventionally used to analyze them. Due to the fixed window  www.nature.com/scientificreports/ width in the STFT, the proper time and frequency resolution cannot be achieved simultaneously. The wavelet transform was proposed to overcome the problems related to Fourier transform by decomposing data into several scales, and each scale represents a particular resolution of the signals. The drawbacks of wavelet transform are choosing the optimum mother wavelet and losing the absolute phase of the data. The Stockwell transform presented by Stockwell et al. 46 is an extension of CWT and STFT. As an effective and efficient time-frequency decomposition method, the Stockwell transform gives high-frequency resolution at low frequencies while obtaining high time resolution at high frequencies. Therefore, in this study, the Stockwell transform was applied to represent EEG signals in time-frequency. The Stockwell transform of a continuous time-domain signal x(t) is represented as: where j = √ −1 and denotes the CWT of signal x(t) and ω(t, f ) defines the Gaussian mother wavelet as: where the factor d represents the inverse of frequency f (d = 1/f ) . Hence, the expression of the Stockwell transform of the continuous signal x(t) is given as 46 : According to (3), the window width in Stockwell transform depends on the frequency f . Thus, it becomes wider as the frequency decreases, and when the frequency increases, it becomes narrower 47 .
Let assume x(nT) , n = 0, 1, . . . , N − 1 be a discrete-time signal Acquired by sampling the continuous signal x(t) where T is the sampling period. The discrete Stockwell transform is derived from the discrete Fourier transform (DFT) of the input signal. The N-point DFT of the signal can be expressed by www.nature.com/scientificreports/ Stockwell transform is defined in discrete form as being the projection of a vector onto a spanning set 46 . Discretization of (4) results in the discrete Stockwell transform: where G(n, k) = exp − 2π 2 k 2 n 2 represents a Gaussian function and n, m = 0, 1, . . . , N − 1 . The amplitude of Stockwell transform is needed for feature extraction, which is calculated as: It was demonstrated in Ref. 33 that two electrodes placed in C3 and C4 are sufficient for classifying different imagery tasks. Hence, in this paper, the Stockwell transform was performed on signals obtained from C3 and C4 channels, and the corresponding absolute TFMs are shown in Figs. 4 and 5 for the left-and right-hand motions, respectively. The performing or even the imagination of motor movements can arouse specific patterns called event-related synchronization (ERS) and event-related desynchronization (ERD) in the brain activity, which occurs in the alpha (8)(9)(10)(11)(12)(13) and beta (14-28 Hz) frequency ranges 48,49 . Since these phenomena are important in classifying MI EEG signals, a band-pass filter was applied on the raw EEG signals in 7-30 Hz. The TFMs of C3 and C4 electrodes in the range 7-30 Hz are then stacked vertically as shown in Fig. 6. As observed that TFMs of the left-hand and right-hand task are different, we can use them to classify MI tasks.  The input of the CNN is stacked TFMs, and its output is a deep feature vector. The convolution layer is the first layer in CNN to extract features from an input TFM by applying different filters (kernels) and passing results to the pooling layer. Limiting the number of layers and the relevant parameters according to the number of training samples is an appropriate solution to avoid over-fitting and reduce the complexity of the functions 33 .
A mini-batch normalization layer and an activation layer are added after each convolution layer. The main objective of using a batch normalization layer between the convolutional layers is to normalize the outputs of each layer to have zero mean and unit variance, which can accelerate and improve the performance of deep neural networks 50 . The nonlinear activation function introduces nonlinearity to the neural network. There are several kinds of activation functions, and the most used ones are sigmoid, tangent hyperbolic, and rectified linear unit (ReLu) function 33 . ReLu is the most effective and popular activation function, which is defined as: www.nature.com/scientificreports/ Hence, for negative input, the output is equal to 0, and for positive input, it is a linear function. The ReLu function is faster and more straightforward than the previous two. As well as due to considerable variation in the outputs for positive inputs, it prevents the vanishing gradient problem. Accordingly, the ReLu activation function is chosen as the activation layer for the CNN in this paper. The pooling layer is the next layer, which is also called the sub-sampling or down-sampling layer. Max pooling and average pooling are the general pooling functions reducing the dimensions of the data by taking the maximum and the average value in the sampling area.
In this research, CNN with two and three layers are considered to extract deep features from TFMs, where the first and second convolutional layers have eight and 16 kernels, respectively, and the last layer in three-layer CNN has 32 filters. The size of all filters is 3 × 3. The structure of the two-Layer CNN is depicted in Fig. 7.
Another approach to extract deeper features from images is using the pre-trained networks and adjusting their weights for new tasks. There are several pre-trained models for image recognition tasks, such as AlexNet, VGG16, VGG19 51 , Inception 52 , MobileNet 53 , and ResNet50 54 . In this paper, we extract features from the last pooling layer of the pre-trained AlexNet and the second fully connected layers of pre-trained VGG19 models and report their performance in our proposed model.

Feature reduction.
After deep feature extraction, the input TFM is represented by a vector with highdimension. The several features maybe not be informative and have a higher correlation with each other. To select the most significant features and decrease the dimension of the feature vector, SDA is employed. SDA considers both labeled and unlabeled samples 55 . The labeled data points maximize the separation between different classes, while the unknown data estimate the basic geometric structure. A smooth discriminant function is fitted to the distribution of data by SDA.
Suppose that x 1 , x 2 , . . . , x N ǫR L denote the N training samples in L-dimensional space that correspond to c classes. The supervised version of SDA, i.e., linear discriminant analysis (LDA), only considers the labeled sample. LDA has the following objective function: where S w and S b refer to the intra-and inter-class scatter matrices, successively, which are computed as follows: where N k denotes the number of training samples for kth class, µ is the total sample mean vector, µ (k) is the mean vector of class k, and x (k) i is the sample i in class k . By defining the total scatter matrix Thus, the objective function equals: If enough training samples are not available, overfitting may occur. Regularizers are typically used to prevent overfitting. In this case, the optimization problem is as follows: (13) max a T S b a a T S t a + βJ(a) , www.nature.com/scientificreports/ where J(a) determines the learning complexity of the hypothesis family, and the regulation coefficient β controls the balance between complexity of the model and the empirical loss. Considering natural regularizer, we have: where S is the weight matrix defined as: where N p (x i ) stands for the set of p nearest neighbors of x i . D is a diagonal matrix; its entries are column (or row, since S is symmetric) sum of S, D ii = j S ij and L = D − S is the Laplacian matrix. Hence, the objective function of SDA can be formulated as: The objective function is maximized by the projective vector a which is defined by the maximum eigenvalue solution to the generalized eigenvalue problem: , where c is the number of non-zero eigenvalues, samples are embedded as: As observed the performance of SDA depends on the regulation parameter β. In this paper, the Bayesian optimization is employed to find the optimum value of the parameter β which yields in the highest classification accuracy.
Classification. In this paper, five well-known machine learning classifiers were applied to classify two-class feature vectors, and their results are compared. Due to different behavior of classifiers in some cases, a fusion method was used to enhance the reliability of overall classification accuracy by combining the decisions of classifiers.
Support vector machine (SVM). Vapnik 56 introduced the SVM as the robust classifier. Due to its lower computational complexity and easy processing of small datasets, it has been commonly employed in various BCI studies 4,57-59 . The optimal hyperplane in SVM maximizes the marginal distance between classes. In this paper, linear SVM was considered.
Discriminant analysis. Low computation requirement and easy implementation make discriminant analysis one of ideal classifiers for EEG based-BCIs 29,60 . In the discriminant analysis method, the boundary among classes is defined based on maximizing the ratio of inter-class variance and minimizing intra-class variance. The discriminant analysis classification technique uses Bayes' Theorem to predict which class the test data belongs to 61 .

k-Nearest neighbor (kNN).
The kNN approach is a famous statistical method in machine learning-based classification algorithms. The kNN is a simple classifier in MI tasks 59,62 classifies each test data by considering the k distance metrics between the test data and those of the closest classes in the feature space. As a result, the parameter k is an essential key in the performance of the kNN.
Decision tree (DT). DT is a supervised machine learning technique in which a dataset is continuously split into subsets based on a particular parameter. This classifier uses a tree-like structure that contains the root, internal decision, and terminal nodes. The root node is considered as the whole dataset sorted into branches. The intermediate subsets are called decision nodes, and the terminal node shows the predicted classes 63 .
Random forest (RF). The RF is a supervised machine learning classifier proposed by Leo Breiman in 2001 64 . RF classifiers collect decisions of multiple DT classifiers where a random subset of the features is selected to train each DT classifier. This process increases the variation among the trees; hence it overcomes overfitting. Eventually, combining the results of all DTs determines the final decision on new data.
max a a T S b a a T S t + βXLX T a . www.nature.com/scientificreports/ Ensemble of classifiers. The ensemble is the combination of two or more individual classification models to improve the overall performance. A robust ensemble model is based on two essential parameters: the accuracy and diversity of classifiers 41 . In this research, the majority voting ensemble, one of the most popular combination approaches for classification 65 , was used to combine the results of five classifiers for the final decision, as shown in Fig. 8. In this model, the final class prediction is the one that receives more than half of the votes among the base classifiers. Informed consent. All methods were carried out in accordance with relevant guidelines and regulations and were approved department of medical informatics, institute for biomedical engineering, university of technology, Graz, Austria.

Results and discussion
This section reports the results of the conducted experiments. The performance of the proposed model was evaluated through classification accuracy, kappa score, confusion matrix, precision, and sensitivity. The classification accuracy as the most widely used measure defined as 34 : where TP (true positive) is the number of correctly classified feature sets, and TN (true negative) is the number of correctly rejected ones. FN (false negative) is the number of feature sets identified wrongly, and FP (false positive) is the number of wrongly rejected feature sets. The values for all these parameters are derived from the confusion matrix. Sensitivity, also known as recall, is the ability of the model to predict all the true positives of each specific class. It is obtained as 71 : The precision reflects the proportion of accurate positive predictions out of the total number of samples classified as positive: Besides, the kappa score was applied to measure the classification performance of the proposed model and eliminate the randomness effects 72 . It is calculated as follows: where rAcc. denotes the random accuracy, which is defined as  Data preparation. Each raw EEG signal of dataset II-III has a duration of nine seconds. However, the last six seconds of the original EEG signal are considered for MI classification. We consider the six-second duration of the trial and multiple smaller segments within the trial. The objective of sliding time windows within the trial is to discover the most effective time duration in classification accuracy. In this work, three windows with the length of two, three, and four seconds were considered to extract EEG segments from both training and test datasets with a stride of 250 ms. The first segments start from the third second of the original signal, and the last ones finished at the trial end. As an example, segments with three seconds time duration are shown in Fig. 9. The 50% of data was used for training and the remaining data was considered for test phase. The first three sessions of dataset IV-2b were considered in this paper. The MI segment in this dataset has the length of three seconds. Hence, we only consider the two-second sliding windows with a stride of 250 ms. The 50% of data was used for training and the remaining data was considered for test phase.
Feature reduction. The CNN automatically extracts the high-dimensional deep features from each TFM.
All extracted features are not informative, and most of them are redundant. As mentioned, SDA is considered for feature reduction. The size of input feature vector depends on the structure of CNN which equals to 48,400 for proposed two-layer CNN. According to characteristics of SDA, the size of the reduced feature vector equals to the number of classes which is equal to two in this paper. In simulations, 2/3 of training samples are considered labeled data, and the remaining ones are treated as unlabeled data. Simulations show that there are two non-zero eigenvalues; hence, SDA reduces the number of features to two, which reduces the computational complexity considerably. The scatter plot of the features generated by SDA for different lengths and locations of the sliding window is shown in Fig. 10. It is observed that the length of the sliding window and its location has a considerable effect on the distribution of features generated by SDA. Hence, classification accuracy is expected to vary by length and location of the prediction window, shown in the following.

Results of whole MI trials.
We considered the optimization procedure to find the hyperparameters of classifiers. For SVM classifier, the box constraint and kernel type, i.e., linear, quadratic, cubic, or gaussian, are found by Bayesian optimization. In addition, for gaussian kernel, its scale was also optimized. In the case of kNN classifier, number of neighbors, distance metric and distance weight were obtained by Bayesian optimization. Distance metric is chosen from Euclidean, Mahalanobis, cubic and cosine. The weighting scheme is also chosen from equal, inverse, and squared inverse. For decision tree, the maximum number of splits is found by Bayesian optimizer. Gini's diversity index was considered as split criterion and a node in a tree is height-balanced if the heights of its subtrees differ by no more than one. The discriminant type of discriminant classifierwas found among linear, quadratic, diagonal linear, and diagonal quadratic by grid search. Finally, Bayesian optimizer finds the minimum leaf size and number of predictors to sample for random forest classifier.
A comparative study of the proposed model's classification accuracy and kappa score in Tables 2, 3, 4 and 5 for different classifiers. These tables compare the performance of five single classifiers and their fusion with the majority voting method based on deep features extracted by two-and three-layers CNN and pre-trained models, including AlexNet and VGG19. In order to evaluate the effectiveness of the Stockwell transform, the results of Stockwell TFM are compared with the Morlet wavelet transform and STFT, which showed relatively better results than other mother wavelets in recent studies 27,33,73 . www.nature.com/scientificreports/ Table 4 shows that the Morlet wavelet transform has a better average classification accuracy than the Stockwell transform when the pre-trained AlexNet network is applied for extracting deep features. However, the maximum achieved accuracy is still less than the best achieved accuracy using Stockwell transform by other deep CNN models. Most classifiers have achieved comparatively better performance with proposed Stockwell-based features in the classification of EEG signals. The results indicate that in the proposed model, for the dataset II-III, the majority voting classifier has the highest classification accuracy of 97.14% and 86.05%, respectively on datasets   Table 6 presents the confusion matrix, sensitivity, and precision for our proposed fusion model based on the Stockwell transform related to two-layer CNN. It demonstrates the correspondence between the predicted and   Table 6. Confusion matrix for fusion model. www.nature.com/scientificreports/ actual labels for each action class in the considered datasets. As observed, the model's sensitivity for right-hand imagery movements achieved the better rate than that of the left hand.

Predicted labels Precision (%) Sensitivity (%) Left hand (%) Right hand (%)
Classification results of sliding window. Here we discuss the location of the sliding windows on the accuracy of the proposed method. Tables 7, 8 and 9 present the performances of classification methods on three different segments size using CNN with two layers for dataset II-III. Regarding two-second segments, the best accuracy of 98.57% was obtained by kNN and majority voting in 3.75-5.75 s time duration, and the segments extracted from the last two seconds of the trial showed the lowest accuracy rate. Similarly, the results in Table 8 indicate that the SVM, kNN, and majority voting classification algorithms have attained the highest accuracy and kappa value of 99.29% and 0.986, respectively, in the 3.25-6.25 s time duration. In contrast, the lowest accuracy has been mainly achieved for the last segment. Table 9 shows similar results for four-second segments with the highest classification accuracy of 98.57% by the SVM and majority voting classifiers, while the DT classifier reported the minimum amounts in all segments.
Since the length of MI segment in dataset IV-2b is three seconds, we only considered the windows with the length of two seconds. Table 10 summarizes the best, worst and mean accuracies for considered classifiers. It is observed that majority voting achieves the highest accuracy of 89.02% considering the window between 3.25 and 5.25 s. The worst accuracy of 63.59 belongs to the DT and discriminant classifiers in the range 4-6 s. Also, majority voting classifier has the highest average accuracy as 80.78%. Figure 11a depicts the classification accuracy of the classifiers on two-second segments. The results demonstrate that all classification methods have performed comparatively better in classification accuracy and kappa  www.nature.com/scientificreports/ www.nature.com/scientificreports/ value in the beginning seconds of the MI task. Then, the overall classification accuracy is trending downward, where the lowest performance has been yielded in the last segments. A similar trend can be seen in Fig. 11b,c for three-second and four-second time duration segments, respectively. Also, Fig. 11d presents the accuracies of different two-second sliding windows for dataset IV-2b. Therefore, finding the most effective time duration of the signals depends on various factors such as the segment size, the delay in conducting the imagery task according to the cue, and the subject's concentration during the trial. Another finding of this study is that although the majority voting ensemble improves the classification performance in some segments, a minor improvement was observed in the overall accuracy compared with individual classifiers, especially SVM. Therefore, it can be concluded that using a simple machine learning algorithm such as SVM as the final classification method in the proposed model is better than applying the fusion model in terms of accuracy, processing time, and computational complexity.
The confusion matrix of the best-achieved classification accuracy of 99.29% by SVM, kNN and majority voting classifier, which is for 3.25-6.25 s time duration, is given in Table 11. For the dataset IV-2b, the maximum accuracy of 89.02% in the duration of 3.25-5.25 is obtained by majority voting classifier. The overall results demonstrate the efficiency of the proposed model at classifying MI EEG signals.
Effect of feature reduction on accuracy. Here, we evalute the effect of feature reduction on the accuracy of the proposed method. To this end, we compare the performance of the proposed method with other feature reduction schemes such as PCA, locality preserving projection (LPP) 74 , and neighborhood preserving embedding (NPE) 75 . We also presented the accuracy considering the original feature vector. The results are given in Table 12. The results indicate that the SDA considerably enhances the accuracy of classification.
Performance comparison. Various approaches have been proposed to classify MI signals. In order to compare the classification results of BCI competition II dataset III, the best result achieved in this study is compared with other methods found in the existing studies in terms of accuracy (Table 13). The authors in Ref. 76 have proposed STFT-based TFM as input and considered a single layer CNN, stacked autoencoders (SAE), and a combination of them (CNN-SAE) to classify MI EEG signals. They reported classification accuracy of 90% using CNN-SAE on BCI competition II dataset III. In Ref. 33 , a two-layer CNN was developed to classify a combination of TFMs of C3, Cz and C4 channels using different mother wavelets. The best accuracy rate of their work for the current dataset was 92.75% based on the 3.25-6.25 s time duration. In Ref. 77 extracted spatial-temporal features using the multivariate empirical mode decomposition were classified with SVM and achieved 85.2%. Also, higher-order dynamic mode decomposition and multichannel singular spectrum decomposition hybridization were considered in Ref. 78 for feature extraction. The authors in Ref. 27 utilized three various mother wavelets, i.e., Morlet, Bump, and Mexican wavelets, to extract the TFMs. They achieved better classification accuracy using the Bump wavelet for combined mu and beta bands and a one-dimensional CNN as the classification method. In Ref. 29 , a flexible analytic wavelet transform (FAWT) was implemented to decompose MI EEG signals into multiple sub-bands. Then, the reduced statistical features by the multidimensional scaling (MDS) technique were classified using the LDA classifier. The model resulted in 94.29% classification accuracy on dataset II-III.
In Ref. 73 , the magnitude and phase information extracted from CWT images' real and imaginary parts were fed to a one-layer CNN. The proposed method achieved the best 94.6% classification accuracy. The method described in Ref. 34 explored various transfer learning models such as VGG19, AlexNet, VGG16, SqueezeNet, ResNet50, GoogleNet, DenseNet201, ResNet18, and ResNet101 to classify Morse wavelet-based TFMs. The method reached up to 95.71% classification accuracy in the case of VGG19. In Ref. 24 , a new dynamic multi-scale layer was added to the ResNet network to extract the multi-scale characteristics from the STFT features of the input signal. They have obtained an accuracy of 90.47%. The authors in Ref. 35 , employed two CBAMs in a twolayer CNN for classification of the subtraction TFMs of two C3 and C4 channels. Huang and colleagues in Ref. 18   Table 11. Confusion matrix, sensitivity, and precision for best accuracy. In the following, the some papers focused on dataset IV-2b are reviewed. Boltzman machines were employed in Ref. 79 and reaches the accuracy of 84.2%. A combination of spectrogram and scalogram as input TFM given to CNN + LSTM structure was yielded the accuracy of 73.8% 80 . The combination of Hjorth parameters as extracted features, ANOVA for feature selection and SVM for classification reaches the accuracy of 82.58% in Ref. 81 . Dual-tree complex wavelet was used in Ref. 82 to extract the time-frequency component of EEG signals. After selection of efficient features by NCA, the SVM classified the BCI MI EEG signals which yields the accuracy of 84.02%. In Ref. 83 , parallel CNNs were used classify of TFM obtained from STFT and the accuracy of 83% was achieved. The results show that the proposed method with the accuracy of 89.02% outperforms the recently introduced methods.

Predicted labels Precision (%) Sensitivity (%) Left hand (%) Right hand (%)
Most of the mentioned works have incorporated wavelet transform-based approaches to extract the feature of the whole-time duration of the MI EEG signals. While, in the present study, finding the location and duration of the most exciting part of the signal has been investigated in detail, and better accuracy and kappa value have been yielded by the Stockwell transform-based features.

Conclusion
In this paper, a new approach based on Stockwell TFMs of EEG signals was proposed to enhance the classification accuracy and reduce the deep features to classify the left-and right-hand movement imagery. In this study, the Stockwell transform was to decompose the time-frequency content of EEG signals, since it provides better resolution than the others such as wavelet transform and STFT. We considered early fusion scheme and combined the Stockwell transform of different channels before deep feature extraction. Compared to other studies which mainly focused on one specific scheme for the classification stage, we examined different machine learning methods as well as their fusion to cover each other's weaknesses. Four CNN models were used to extract high-dimensional deep features, where the TFMs of C3 and C4 channels in the frequency range of [10 30] Hz were concatenated and considered as input of CNN. Since there are a large number of features extracted by CNN, SDA was employed to reduce them to two. The classification accuracy of different optimized classifiers and a fusion of them by the majority voting method were compared. The whole MI EEG signals with six seconds length and multiple small segments of the signal with the lengths of two, three, and four seconds with different locations were considered for classification. Results indicate that the fusion model does not outperform the maximum individual classifier performance in most cases. The accuracy of 99.29% and 89.02% were obtained for datasets II-III and IV-2b, respectively, by two-layer CNN. The accuracy achieved in this study demonstrates the efficiency of our proposed method in comparison with previous studies on BCI competition II dataset III. Hence, the proposed method can be used in BCI systems to provide reliable communication between paralyzed people and external devices.
Results also indicated that most information of EEG signals is at the beginning EEG samples of MI task, and there is less information at the last EEG samples of MI task.
Considering the single-modal, i.e., EEG, for feature extraction, can limit the performance of the proposed scheme when there are more than two classes. Also, training process of CNN takes long time which is dependent to the structure of CNN. In order to enhance the performance of classification, especially in the case multi-class scenarios, the multimodal scheme, such as combination of functional near-infrared spectroscopy (fNIRS) and EEG can be considered. Also, considering attention-based deep structures can further increase the classification